About Dataset
Description: Breast cancer is the most common cancer amongst women in the world. It accounts for 25% of all cancer cases, and affected over 2.1 Million people in 2015 alone. It starts when cells in the breast begin to grow out of control. These cells usually form tumors that can be seen via X-ray or felt as lumps in the breast area.
The key challenges against it’s detection is how to classify tumors into malignant (cancerous) or benign(non cancerous). We ask you to complete the analysis of classifying these tumors using machine learning (with SVMs) and the Breast Cancer Wisconsin (Diagnostic) Dataset.
Acknowledgements: This dataset has been referred from Kaggle. link : https://www.kaggle.com/datasets/yasserh/breast-cancer-dataset
Problem Statement: Breast Cancer Classification using Support Vector Machine (SVM)
Breast cancer is one of the most common and life-threatening diseases affecting women worldwide. Early diagnosis significantly improves the chances of successful treatment and recovery. However, manual diagnosis based on biopsy or imaging can be time-consuming and subject to human error.
This project aims to develop a binary classification model using Support Vector Machine (SVM) to automatically distinguish between malignant and benign breast tumors using the Breast Cancer Wisconsin dataset. The goal is to create a reliable, efficient, and accurate model that can assist healthcare professionals in decision-making.
Objectives:
Load and preprocess the Breast Cancer dataset to make it suitable for binary classification.
Train SVM classifiers using both linear and RBF kernels.
Visualize the decision boundaries using 2D projection of the data.
Tune hyperparameters like C and gamma using GridSearchCV for better performance.
Evaluate model performance using cross-validation, confusion matrix, and classification report
Importing require Libraries¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder , StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV,cross_val_score
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix, classification_report ,accuracy_score
import the Breast Cancer Dataset¶
breast_dataset = pd.read_csv("breast-cancer.csv")
print("Top 10 rows :\n")
breast_dataset.head(10)
Top 10 rows :
| id | diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | ... | radius_worst | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 842302 | M | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.30010 | 0.14710 | ... | 25.38 | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 |
| 1 | 842517 | M | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.08690 | 0.07017 | ... | 24.99 | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 |
| 2 | 84300903 | M | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.19740 | 0.12790 | ... | 23.57 | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 |
| 3 | 84348301 | M | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.24140 | 0.10520 | ... | 14.91 | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 |
| 4 | 84358402 | M | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.19800 | 0.10430 | ... | 22.54 | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 |
| 5 | 843786 | M | 12.45 | 15.70 | 82.57 | 477.1 | 0.12780 | 0.17000 | 0.15780 | 0.08089 | ... | 15.47 | 23.75 | 103.40 | 741.6 | 0.1791 | 0.5249 | 0.5355 | 0.1741 | 0.3985 | 0.12440 |
| 6 | 844359 | M | 18.25 | 19.98 | 119.60 | 1040.0 | 0.09463 | 0.10900 | 0.11270 | 0.07400 | ... | 22.88 | 27.66 | 153.20 | 1606.0 | 0.1442 | 0.2576 | 0.3784 | 0.1932 | 0.3063 | 0.08368 |
| 7 | 84458202 | M | 13.71 | 20.83 | 90.20 | 577.9 | 0.11890 | 0.16450 | 0.09366 | 0.05985 | ... | 17.06 | 28.14 | 110.60 | 897.0 | 0.1654 | 0.3682 | 0.2678 | 0.1556 | 0.3196 | 0.11510 |
| 8 | 844981 | M | 13.00 | 21.82 | 87.50 | 519.8 | 0.12730 | 0.19320 | 0.18590 | 0.09353 | ... | 15.49 | 30.73 | 106.20 | 739.3 | 0.1703 | 0.5401 | 0.5390 | 0.2060 | 0.4378 | 0.10720 |
| 9 | 84501001 | M | 12.46 | 24.04 | 83.97 | 475.9 | 0.11860 | 0.23960 | 0.22730 | 0.08543 | ... | 15.09 | 40.68 | 97.65 | 711.4 | 0.1853 | 1.0580 | 1.1050 | 0.2210 | 0.4366 | 0.20750 |
10 rows × 32 columns
Total_Rows = breast_dataset.shape[0]
Total_Cols = breast_dataset.shape[1]
print("Total Rows is :",Total_Rows)
print("Total columns is :", Total_Cols)
Total Rows is : 569 Total columns is : 32
print( "\n Information about the Titanic dataset : \n " )
breast_dataset.info()
Information about the Titanic dataset : <class 'pandas.core.frame.DataFrame'> RangeIndex: 569 entries, 0 to 568 Data columns (total 32 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 569 non-null int64 1 diagnosis 569 non-null object 2 radius_mean 569 non-null float64 3 texture_mean 569 non-null float64 4 perimeter_mean 569 non-null float64 5 area_mean 569 non-null float64 6 smoothness_mean 569 non-null float64 7 compactness_mean 569 non-null float64 8 concavity_mean 569 non-null float64 9 concave points_mean 569 non-null float64 10 symmetry_mean 569 non-null float64 11 fractal_dimension_mean 569 non-null float64 12 radius_se 569 non-null float64 13 texture_se 569 non-null float64 14 perimeter_se 569 non-null float64 15 area_se 569 non-null float64 16 smoothness_se 569 non-null float64 17 compactness_se 569 non-null float64 18 concavity_se 569 non-null float64 19 concave points_se 569 non-null float64 20 symmetry_se 569 non-null float64 21 fractal_dimension_se 569 non-null float64 22 radius_worst 569 non-null float64 23 texture_worst 569 non-null float64 24 perimeter_worst 569 non-null float64 25 area_worst 569 non-null float64 26 smoothness_worst 569 non-null float64 27 compactness_worst 569 non-null float64 28 concavity_worst 569 non-null float64 29 concave points_worst 569 non-null float64 30 symmetry_worst 569 non-null float64 31 fractal_dimension_worst 569 non-null float64 dtypes: float64(30), int64(1), object(1) memory usage: 142.4+ KB
Columns distribution¶
feature_cols = breast_dataset.columns.tolist()
print("Total columns is here :\n", breast_dataset.columns.tolist())
Total columns is here : ['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst']
numerical_col = breast_dataset.select_dtypes(include=["int64","float64"]).columns
print("Total Numerical columns list is here :\n", numerical_col)
print("\nTotal Numerical columns is here :\n" ,numerical_col.value_counts().sum())
Total Numerical columns list is here :
Index(['id', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
'smoothness_mean', 'compactness_mean', 'concavity_mean',
'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
'fractal_dimension_se', 'radius_worst', 'texture_worst',
'perimeter_worst', 'area_worst', 'smoothness_worst',
'compactness_worst', 'concavity_worst', 'concave points_worst',
'symmetry_worst', 'fractal_dimension_worst'],
dtype='object')
Total Numerical columns is here :
31
categorical_col = breast_dataset.select_dtypes(include=["O"]).columns
print("Total categorical columns list is here :\n", categorical_col)
print("\nTotal categorical columns is here :\n" ,categorical_col.value_counts().sum())
Total categorical columns list is here : Index(['diagnosis'], dtype='object') Total categorical columns is here : 1
basic statistics about the Iris data :¶
breast_dataset.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| id | 569.0 | 3.037183e+07 | 1.250206e+08 | 8670.000000 | 869218.000000 | 906024.000000 | 8.813129e+06 | 9.113205e+08 |
| radius_mean | 569.0 | 1.412729e+01 | 3.524049e+00 | 6.981000 | 11.700000 | 13.370000 | 1.578000e+01 | 2.811000e+01 |
| texture_mean | 569.0 | 1.928965e+01 | 4.301036e+00 | 9.710000 | 16.170000 | 18.840000 | 2.180000e+01 | 3.928000e+01 |
| perimeter_mean | 569.0 | 9.196903e+01 | 2.429898e+01 | 43.790000 | 75.170000 | 86.240000 | 1.041000e+02 | 1.885000e+02 |
| area_mean | 569.0 | 6.548891e+02 | 3.519141e+02 | 143.500000 | 420.300000 | 551.100000 | 7.827000e+02 | 2.501000e+03 |
| smoothness_mean | 569.0 | 9.636028e-02 | 1.406413e-02 | 0.052630 | 0.086370 | 0.095870 | 1.053000e-01 | 1.634000e-01 |
| compactness_mean | 569.0 | 1.043410e-01 | 5.281276e-02 | 0.019380 | 0.064920 | 0.092630 | 1.304000e-01 | 3.454000e-01 |
| concavity_mean | 569.0 | 8.879932e-02 | 7.971981e-02 | 0.000000 | 0.029560 | 0.061540 | 1.307000e-01 | 4.268000e-01 |
| concave points_mean | 569.0 | 4.891915e-02 | 3.880284e-02 | 0.000000 | 0.020310 | 0.033500 | 7.400000e-02 | 2.012000e-01 |
| symmetry_mean | 569.0 | 1.811619e-01 | 2.741428e-02 | 0.106000 | 0.161900 | 0.179200 | 1.957000e-01 | 3.040000e-01 |
| fractal_dimension_mean | 569.0 | 6.279761e-02 | 7.060363e-03 | 0.049960 | 0.057700 | 0.061540 | 6.612000e-02 | 9.744000e-02 |
| radius_se | 569.0 | 4.051721e-01 | 2.773127e-01 | 0.111500 | 0.232400 | 0.324200 | 4.789000e-01 | 2.873000e+00 |
| texture_se | 569.0 | 1.216853e+00 | 5.516484e-01 | 0.360200 | 0.833900 | 1.108000 | 1.474000e+00 | 4.885000e+00 |
| perimeter_se | 569.0 | 2.866059e+00 | 2.021855e+00 | 0.757000 | 1.606000 | 2.287000 | 3.357000e+00 | 2.198000e+01 |
| area_se | 569.0 | 4.033708e+01 | 4.549101e+01 | 6.802000 | 17.850000 | 24.530000 | 4.519000e+01 | 5.422000e+02 |
| smoothness_se | 569.0 | 7.040979e-03 | 3.002518e-03 | 0.001713 | 0.005169 | 0.006380 | 8.146000e-03 | 3.113000e-02 |
| compactness_se | 569.0 | 2.547814e-02 | 1.790818e-02 | 0.002252 | 0.013080 | 0.020450 | 3.245000e-02 | 1.354000e-01 |
| concavity_se | 569.0 | 3.189372e-02 | 3.018606e-02 | 0.000000 | 0.015090 | 0.025890 | 4.205000e-02 | 3.960000e-01 |
| concave points_se | 569.0 | 1.179614e-02 | 6.170285e-03 | 0.000000 | 0.007638 | 0.010930 | 1.471000e-02 | 5.279000e-02 |
| symmetry_se | 569.0 | 2.054230e-02 | 8.266372e-03 | 0.007882 | 0.015160 | 0.018730 | 2.348000e-02 | 7.895000e-02 |
| fractal_dimension_se | 569.0 | 3.794904e-03 | 2.646071e-03 | 0.000895 | 0.002248 | 0.003187 | 4.558000e-03 | 2.984000e-02 |
| radius_worst | 569.0 | 1.626919e+01 | 4.833242e+00 | 7.930000 | 13.010000 | 14.970000 | 1.879000e+01 | 3.604000e+01 |
| texture_worst | 569.0 | 2.567722e+01 | 6.146258e+00 | 12.020000 | 21.080000 | 25.410000 | 2.972000e+01 | 4.954000e+01 |
| perimeter_worst | 569.0 | 1.072612e+02 | 3.360254e+01 | 50.410000 | 84.110000 | 97.660000 | 1.254000e+02 | 2.512000e+02 |
| area_worst | 569.0 | 8.805831e+02 | 5.693570e+02 | 185.200000 | 515.300000 | 686.500000 | 1.084000e+03 | 4.254000e+03 |
| smoothness_worst | 569.0 | 1.323686e-01 | 2.283243e-02 | 0.071170 | 0.116600 | 0.131300 | 1.460000e-01 | 2.226000e-01 |
| compactness_worst | 569.0 | 2.542650e-01 | 1.573365e-01 | 0.027290 | 0.147200 | 0.211900 | 3.391000e-01 | 1.058000e+00 |
| concavity_worst | 569.0 | 2.721885e-01 | 2.086243e-01 | 0.000000 | 0.114500 | 0.226700 | 3.829000e-01 | 1.252000e+00 |
| concave points_worst | 569.0 | 1.146062e-01 | 6.573234e-02 | 0.000000 | 0.064930 | 0.099930 | 1.614000e-01 | 2.910000e-01 |
| symmetry_worst | 569.0 | 2.900756e-01 | 6.186747e-02 | 0.156500 | 0.250400 | 0.282200 | 3.179000e-01 | 6.638000e-01 |
| fractal_dimension_worst | 569.0 | 8.394582e-02 | 1.806127e-02 | 0.055040 | 0.071460 | 0.080040 | 9.208000e-02 | 2.075000e-01 |
breast_dataset.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| id | 569.0 | NaN | NaN | NaN | 30371831.432337 | 125020585.612224 | 8670.0 | 869218.0 | 906024.0 | 8813129.0 | 911320502.0 |
| diagnosis | 569 | 2 | B | 357 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| radius_mean | 569.0 | NaN | NaN | NaN | 14.127292 | 3.524049 | 6.981 | 11.7 | 13.37 | 15.78 | 28.11 |
| texture_mean | 569.0 | NaN | NaN | NaN | 19.289649 | 4.301036 | 9.71 | 16.17 | 18.84 | 21.8 | 39.28 |
| perimeter_mean | 569.0 | NaN | NaN | NaN | 91.969033 | 24.298981 | 43.79 | 75.17 | 86.24 | 104.1 | 188.5 |
| area_mean | 569.0 | NaN | NaN | NaN | 654.889104 | 351.914129 | 143.5 | 420.3 | 551.1 | 782.7 | 2501.0 |
| smoothness_mean | 569.0 | NaN | NaN | NaN | 0.09636 | 0.014064 | 0.05263 | 0.08637 | 0.09587 | 0.1053 | 0.1634 |
| compactness_mean | 569.0 | NaN | NaN | NaN | 0.104341 | 0.052813 | 0.01938 | 0.06492 | 0.09263 | 0.1304 | 0.3454 |
| concavity_mean | 569.0 | NaN | NaN | NaN | 0.088799 | 0.07972 | 0.0 | 0.02956 | 0.06154 | 0.1307 | 0.4268 |
| concave points_mean | 569.0 | NaN | NaN | NaN | 0.048919 | 0.038803 | 0.0 | 0.02031 | 0.0335 | 0.074 | 0.2012 |
| symmetry_mean | 569.0 | NaN | NaN | NaN | 0.181162 | 0.027414 | 0.106 | 0.1619 | 0.1792 | 0.1957 | 0.304 |
| fractal_dimension_mean | 569.0 | NaN | NaN | NaN | 0.062798 | 0.00706 | 0.04996 | 0.0577 | 0.06154 | 0.06612 | 0.09744 |
| radius_se | 569.0 | NaN | NaN | NaN | 0.405172 | 0.277313 | 0.1115 | 0.2324 | 0.3242 | 0.4789 | 2.873 |
| texture_se | 569.0 | NaN | NaN | NaN | 1.216853 | 0.551648 | 0.3602 | 0.8339 | 1.108 | 1.474 | 4.885 |
| perimeter_se | 569.0 | NaN | NaN | NaN | 2.866059 | 2.021855 | 0.757 | 1.606 | 2.287 | 3.357 | 21.98 |
| area_se | 569.0 | NaN | NaN | NaN | 40.337079 | 45.491006 | 6.802 | 17.85 | 24.53 | 45.19 | 542.2 |
| smoothness_se | 569.0 | NaN | NaN | NaN | 0.007041 | 0.003003 | 0.001713 | 0.005169 | 0.00638 | 0.008146 | 0.03113 |
| compactness_se | 569.0 | NaN | NaN | NaN | 0.025478 | 0.017908 | 0.002252 | 0.01308 | 0.02045 | 0.03245 | 0.1354 |
| concavity_se | 569.0 | NaN | NaN | NaN | 0.031894 | 0.030186 | 0.0 | 0.01509 | 0.02589 | 0.04205 | 0.396 |
| concave points_se | 569.0 | NaN | NaN | NaN | 0.011796 | 0.00617 | 0.0 | 0.007638 | 0.01093 | 0.01471 | 0.05279 |
| symmetry_se | 569.0 | NaN | NaN | NaN | 0.020542 | 0.008266 | 0.007882 | 0.01516 | 0.01873 | 0.02348 | 0.07895 |
| fractal_dimension_se | 569.0 | NaN | NaN | NaN | 0.003795 | 0.002646 | 0.000895 | 0.002248 | 0.003187 | 0.004558 | 0.02984 |
| radius_worst | 569.0 | NaN | NaN | NaN | 16.26919 | 4.833242 | 7.93 | 13.01 | 14.97 | 18.79 | 36.04 |
| texture_worst | 569.0 | NaN | NaN | NaN | 25.677223 | 6.146258 | 12.02 | 21.08 | 25.41 | 29.72 | 49.54 |
| perimeter_worst | 569.0 | NaN | NaN | NaN | 107.261213 | 33.602542 | 50.41 | 84.11 | 97.66 | 125.4 | 251.2 |
| area_worst | 569.0 | NaN | NaN | NaN | 880.583128 | 569.356993 | 185.2 | 515.3 | 686.5 | 1084.0 | 4254.0 |
| smoothness_worst | 569.0 | NaN | NaN | NaN | 0.132369 | 0.022832 | 0.07117 | 0.1166 | 0.1313 | 0.146 | 0.2226 |
| compactness_worst | 569.0 | NaN | NaN | NaN | 0.254265 | 0.157336 | 0.02729 | 0.1472 | 0.2119 | 0.3391 | 1.058 |
| concavity_worst | 569.0 | NaN | NaN | NaN | 0.272188 | 0.208624 | 0.0 | 0.1145 | 0.2267 | 0.3829 | 1.252 |
| concave points_worst | 569.0 | NaN | NaN | NaN | 0.114606 | 0.065732 | 0.0 | 0.06493 | 0.09993 | 0.1614 | 0.291 |
| symmetry_worst | 569.0 | NaN | NaN | NaN | 0.290076 | 0.061867 | 0.1565 | 0.2504 | 0.2822 | 0.3179 | 0.6638 |
| fractal_dimension_worst | 569.0 | NaN | NaN | NaN | 0.083946 | 0.018061 | 0.05504 | 0.07146 | 0.08004 | 0.09208 | 0.2075 |
breast_dataset.describe(include="O").T
| count | unique | top | freq | |
|---|---|---|---|---|
| diagnosis | 569 | 2 | B | 357 |
Visualizations:¶
sns.pairplot(breast_dataset)
plt.tight_layout()
plt.show()